- When to use the
parallelpackage in R - How to parallelize
applyfunctions - How to parallelize a process that generates random numbers
February 21, 2015
parallel package in Rapply functionsparallel?The parallel package merges multicore and snow
multicore functionality on Unix-like systems and snow functionality on Windowssnow functionality to execute on a cluster for both systems)Make the desired number of cores available to R:
require(parallel) # Determine number of CPU cores in your machine nCores = detectCores() # Create cluster with desired number of cores cl = makeCluster(nCores)
foreach parallelizes for loopsparallel does the same for apply functionsinput = 1:3 sapply(input, sqrt)
## [1] 1.000000 1.414214 1.732051
lapply(input, sqrt)
## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ## ## [[3]] ## [1] 1.732051
Consider this simple example:
input = 1:100
testFun = function(i){
mu = mean(runif(1e+06, 0, i))
return(mu)
}
(There is something wrong here… we'll come back to this.)
How would we do this with foreach?
system.time(
mu.foreach <- foreach(i=1:100,
.combine = "c") %dopar% {
testFun(i)
}
)
Now let's try it with sapply and parSapply
system.time( sapply(input, testFun) )
## user system elapsed ## 3.790 0.112 3.911
system.time( parSapply(cl, input, testFun) )
## user system elapsed ## 0.003 0.000 1.940
Now let's try it with lapply and parLapply
system.time( lapply(input, testFun) )
## user system elapsed ## 5.809 0.112 5.931
system.time( parLapply(cl, input, testFun) )
## user system elapsed ## 0.001 0.000 1.900
We could also use mclapply
# not available on Windows system.time( mclapply(input, testFun, mc.cores=nCores) )
## user system elapsed ## 5.934 0.208 2.268
base = 2
clusterExport(cl, "base")
parLapply(cl, 1:3, function(x){base^x})
## [[1]] ## [1] 2 ## ## [[2]] ## [1] 4 ## ## [[3]] ## [1] 8
Let's go back and fix our first example:
input = 1:100
testFun = function(i){ mean(runif(1e+06, 0, i)) }
clusterSetRNGStream(cl, iseed=2015) res1 = parSapply(cl, input, testFun) clusterSetRNGStream(cl, iseed=2015) res2 = parSapply(cl, input, testFun)
identical(res1, res2)
## [1] TRUE
run1 = function(...){
require(boot); require(splines)
load(url("http://www.stat.colostate.edu/~scharfh/CSP_parallel/data/arrivals_subset.RData"))
bikePred = function(data, indices){
d = data[indices,]
big.glm <- glm(arrivals ~
bs(hour, degree = 4)*weekend
+ bs(hour, degree = 4)*as.factor(id),
data = d, family = "poisson")
mynewdat = data.frame(weekend=FALSE, id=293, hour=17)
return(predict(object=big.glm, newdata=mynewdat, type="response"))
}
boot(data=arrivals.sub, statistic=bikePred, R=250)
}
set.seed(2015) system.time( bike.boot <- do.call(c, lapply(seq_len(nCores), run1)) )
## user system elapsed ## 170.026 1.799 173.381
clusterSetRNGStream(cl, iseed=2015) system.time( bike.boot2 <- do.call(c, parLapply(cl, seq_len(nCores), run1)) )
## user system elapsed ## 0.007 0.002 70.793
boot.ci(bike.boot2, type="perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 1000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = bike.boot2, type = "perc") ## ## Intervals : ## Level Percentile ## 95% (21.54, 24.83 ) ## Calculations and Intervals on Original Scale
Since the boot package has built-in parallel support, we could also simply run the following:
nBoot = nCores*250
set.seed(2015, kind="L'Ecuyer")
system.time(
bike.boot3 <- boot(data=arrivals.sub, statistic=bikePred, R=nBoot,
parallel="multicore", ncpus=4)
)
## user system elapsed ## 181.781 2.518 64.943
Don't forget to stop your cluster when you are done using it.
stopCluster(cl)
Package ‘parallel’ vignette